Goto

Collaborating Authors

 generate data


Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study

Smolyak, Daniel, Welivita, Arshana, Bjarnadóttir, Margrét V., Agarwal, Ritu

arXiv.org Artificial Intelligence

Objective. Demographic groups are often represented at different rates in medical datasets. These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups. One promising solution to this problem is to generate synthetic data to mitigate potential adverse effects of non-representative data sets. Methods. We build on recent advances in LLM-based synthetic data generation to create a pipeline where the synthetic data is generated separately for each demographic group. We conduct our study using MIMIC-IV and Framingham "Offspring and OMNI-1 Cohorts" datasets. We prompt GPT4-Turbo to create group-specific data, providing training examples and the dataset context. An exploratory analysis is conducted to ascertain the quality of the generated data. We then evaluate the utility of the synthetic data for augmentation of a training dataset in a downstream machine learning task, focusing specifically on model performance metrics across groups. Results. The performance of GPT4-Turbo augmentation is generally superior but not always. In the majority of experiments our method outperforms standard modeling baselines, however, prompting GPT-4-Turbo to produce data specific to a group provides little to no additional benefit over a prompt that does not specify the group. Conclusion. We developed a method for using LLMs out-of-the-box to synthesize group-specific data to address imbalances in demographic representation in medical datasets. As another "tool in the toolbox", this method can improve model fairness and thus health equity. More research is needed to understand the conditions under which LLM generated synthetic data is useful for non-representative medical data sets.


COSBO: Conservative Offline Simulation-Based Policy Optimization

Kargar, Eshagh, Kyrki, Ville

arXiv.org Artificial Intelligence

Offline reinforcement learning allows training reinforcement learning models on data from live deployments. However, it is limited to choosing the best combination of behaviors present in the training data. In contrast, simulation environments attempting to replicate the live environment can be used instead of the live data, yet this approach is limited by the simulation-to-reality gap, resulting in a bias. In an attempt to get the best of both worlds, we propose a method that combines an imperfect simulation environment with data from the target environment, to train an offline reinforcement learning policy. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches CQL, MOPO, and COMBO, especially in scenarios with diverse and challenging dynamics, and demonstrates robust behavior across a variety of experimental conditions. The results highlight that using simulator-generated data can effectively enhance offline policy learning despite the sim-to-real gap, when direct interaction with the real-world is not possible.


On the Generalization of Diffusion Model

Yi, Mingyang, Sun, Jiacheng, Li, Zhenguo

arXiv.org Artificial Intelligence

The diffusion probabilistic generative models are widely used to generate high-quality data. Though they can synthetic data that does not exist in the training set, the rationale behind such generalization is still unexplored. In this paper, we formally define the generalization of the generative model, which is measured by the mutual information between the generated data and the training set. The definition originates from the intuition that the model which generates data with less correlation to the training set exhibits better generalization ability. Meanwhile, we show that for the empirical optimal diffusion model, the data generated by a deterministic sampler are all highly related to the training set, thus poor generalization. This result contradicts the observation of the trained diffusion model's (approximating empirical optima) extrapolation ability (generating unseen data). To understand this contradiction, we empirically verify the difference between the sufficiently trained diffusion model and the empirical optima. We found, though obtained through sufficient training, there still exists a slight difference between them, which is critical to making the diffusion model generalizable. Moreover, we propose another training objective whose empirical optimal solution has no potential generalization problem. We empirically show that the proposed training objective returns a similar model to the original one, which further verifies the generalization ability of the trained diffusion model.


Gan vs Diffusion Models -- What's the Difference?

#artificialintelligence

With the potential to revolutionize a wide range of industries, AI is bound to have a major impact on our lives in the coming years. However, before we can grasp the full potential of AI, we need to understand its foundations -- in other words, the diffusion model and generative adversarial networks (GANs). In a nutshell, the diffusion model is a machine-learning model that helps us predict the spread of a given virus or disease. On the other hand, a GAN is a model that can generate images, sounds, or text autonomously. So which one is better?


A New Method to Generate Data for Training Autonomous Vehicles - News

#artificialintelligence

It goes without saying that an autonomous vehicle (AV) must be able to track the movement of pedestrians, animals, bicycles accurately, and other vehicles around it to safely and effectively get from point A to B. The systems responsible for doing this depend on being fed data, among other things, from which it is "trained" and learns to spot and react to these obstacles and hazards. A technique developed by Carnegie Mellon University (CMU) researchers called "scene flow" may be able to deliver improved results by training systems on larger datasets. Generally speaking, the more data that is available for training tracking systems, the better the results will be. And, according to the CMU researchers, they have found a way to unlock a "mountain" of autonomous driving data for exactly that purpose. Most AVs navigate based on sensor data from light detection and radar (LiDAR) systems that scan the environment to generate three-dimensional information of the world surrounding the vehicle.


Top 10 Synthetic Data StartupsMaking a Mark in the Tech Sphere

#artificialintelligence

Designing good data-driven models hugely depends on the quality of data. Well, data is a set of numbers, and shouldn't bother the developers much. As they say, the devil lies in the details, real data comes with a set of issues like imbalanced classes, inherent biases, unstructured values, etc. On the other hand, synthetic data provides the developers with the flexibility of scalability of data and freedom from biases, opening a whole lot of possibilities for creating a model that doesn't exist in the real world. In addition, synthetic data holds the benefits of protecting user data privacy all while giving the freedom to experiment with.


Startups Seek High-Tech Solutions For Massive Food Waste

#artificialintelligence

Startups and venture capital are pouring into what might seem an unlikely place: India's vast, outdated agriculture industry. Seizing on controversial new deregulation, entrepreneurs are selling farmers apps to connect them to big buyers nationwide and using artificial intelligence (AI) to improve the rickety supply chains that lose one-fourth of India's produce to wastage. Enormous amounts of India's grain, fruit and vegetables rot between farm and table because of manual handling, repeated loading and unloading, poor inventory management, lack of adequate storage and slow movement of goods. This rate of wastage from faulty supply chains is four to five times that of most large economies, experts say. Prime Minister Narendra Modi's government introduced changes it calls a watershed that will "remove middlemen and let farmers sell their produce directly to buyers," improving their prospects, especially in far flung areas.


Why Does Image Data Augmentation Work As A Regularizer in Deep Learning?

#artificialintelligence

The problem with deep learning models is they need lots of data to train a model. There are two major problems while training deep learning models is overfitting and underfitting of the model. Those problems are solved by data augmentation is a regularization technique that makes slight modifications to the images and used to generate data. In this article, we will demonstrate why data augmentation is known as a regularization technique. How to apply data augmentation to our model and whether it is used as a preprocessing technique or post-processing techniques…?


Train Generative Adversarial Network (GAN) - MATLAB & Simulink

#artificialintelligence

This example shows how to train a generative adversarial network (GAN) to generate images. A generative adversarial network (GAN) is a type of deep learning network that can generate data with similar characteristics as the input real data. Generator -- Given a vector of random values (latent inputs) as input, this network generates data with the same structure as the training data. Discriminator -- Given batches of data containing observations from both the training data, and generated data from the generator, this network attempts to classify the observations as "real" or "generated". Train the generator to generate data that "fools" the discriminator.


How to generate data for machine learning – Bits&Chips

#artificialintelligence

Jan Bosch is a research center director, professor, consultant and angel investor in start-ups. You can contact him at jan@janbosch.com. In recent columns, I've been sharing my view on the quality of the data that many companies have in their data warehouses, lakes or swamps. In my experience, most of the data that companies have stored so carefully is useless and will never generate any value for the company. The data that actually is potentially useful tends to require vast amounts of preprocessing before it can be used for machine learning, for example.